Date Created March 2023
To answer this question, I am going to talk a bit about myself. I am a recent graduate in Business Analytics or Data Science in general, and I would like to kick off the process of hunting jobs, therefore, it is wise to find out more about salary for salary expectation and negotiation. Another reason is part of my curiosity as an ex-payoll specialist to understand current compensation trend, how rewarding for a senior position and much more.
All in all, this project aims to:
-Provide historical salary information to new graduates
-Supply salary range between different lelvels to current employees
-Encourage data nerds on their journey by seeing the potential of salary increment.
So, let's dive in!
I downloaded the dataset from Kaggle for this analysis
#Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly.express as px
#Import dataset
df=pd.read_csv(r"C:\Users\huong\OneDrive\Máy tính\4. JOB RELATED MATTERS\2. Python\Project 2 - Portfolio\ds_salaries.csv")
df
| Unnamed: 0 | work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2020 | MI | FT | Data Scientist | 70000 | EUR | 79833 | DE | 0 | DE | L |
| 1 | 1 | 2020 | SE | FT | Machine Learning Scientist | 260000 | USD | 260000 | JP | 0 | JP | S |
| 2 | 2 | 2020 | SE | FT | Big Data Engineer | 85000 | GBP | 109024 | GB | 50 | GB | M |
| 3 | 3 | 2020 | MI | FT | Product Data Analyst | 20000 | USD | 20000 | HN | 0 | HN | S |
| 4 | 4 | 2020 | SE | FT | Machine Learning Engineer | 150000 | USD | 150000 | US | 50 | US | L |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 602 | 602 | 2022 | SE | FT | Data Engineer | 154000 | USD | 154000 | US | 100 | US | M |
| 603 | 603 | 2022 | SE | FT | Data Engineer | 126000 | USD | 126000 | US | 100 | US | M |
| 604 | 604 | 2022 | SE | FT | Data Analyst | 129000 | USD | 129000 | US | 0 | US | M |
| 605 | 605 | 2022 | SE | FT | Data Analyst | 150000 | USD | 150000 | US | 100 | US | M |
| 606 | 606 | 2022 | MI | FT | AI Scientist | 200000 | USD | 200000 | IN | 100 | US | L |
607 rows × 12 columns
#Have a glance
df.shape
(607, 12)
#1.Drop the following columns: unnamed:0, salary and salary_currency as they are redundant information.
#We have colum salary_in_usd which carries salary infotamation in one currency which will be useful.
df.drop(['Unnamed: 0','salary_currency','salary'], axis="columns", inplace=True)
df
| work_year | experience_level | employment_type | job_title | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020 | MI | FT | Data Scientist | 79833 | DE | 0 | DE | L |
| 1 | 2020 | SE | FT | Machine Learning Scientist | 260000 | JP | 0 | JP | S |
| 2 | 2020 | SE | FT | Big Data Engineer | 109024 | GB | 50 | GB | M |
| 3 | 2020 | MI | FT | Product Data Analyst | 20000 | HN | 0 | HN | S |
| 4 | 2020 | SE | FT | Machine Learning Engineer | 150000 | US | 50 | US | L |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 602 | 2022 | SE | FT | Data Engineer | 154000 | US | 100 | US | M |
| 603 | 2022 | SE | FT | Data Engineer | 126000 | US | 100 | US | M |
| 604 | 2022 | SE | FT | Data Analyst | 129000 | US | 0 | US | M |
| 605 | 2022 | SE | FT | Data Analyst | 150000 | US | 100 | US | M |
| 606 | 2022 | MI | FT | AI Scientist | 200000 | IN | 100 | US | L |
607 rows × 9 columns
#2.Check if any missing values - No missing values at all!
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 607 entries, 0 to 606 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 work_year 607 non-null int64 1 experience_level 607 non-null object 2 employment_type 607 non-null object 3 job_title 607 non-null object 4 salary_in_usd 607 non-null int64 5 employee_residence 607 non-null object 6 remote_ratio 607 non-null int64 7 company_location 607 non-null object 8 company_size 607 non-null object dtypes: int64(3), object(6) memory usage: 42.8+ KB
#3.Check outliers using boxplot
#There is one observation of 600,000USD package for a full-time Principle Data Engineer
#This datapoint is far from others but it is possible scenario on the market in 2021
#Hence we will keep this outlier
df.boxplot(column=["salary_in_usd"])
<AxesSubplot:>
#4.Check and remove duplicate
#There are no duplicates - we still have 607 observations
df['salary_in_usd'].drop_duplicates().sort_values(ascending=True)
df.drop_duplicates()
| work_year | experience_level | employment_type | job_title | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020 | MI | FT | Data Scientist | 79833 | DE | 0 | DE | L |
| 1 | 2020 | SE | FT | Machine Learning Scientist | 260000 | JP | 0 | JP | S |
| 2 | 2020 | SE | FT | Big Data Engineer | 109024 | GB | 50 | GB | M |
| 3 | 2020 | MI | FT | Product Data Analyst | 20000 | HN | 0 | HN | S |
| 4 | 2020 | SE | FT | Machine Learning Engineer | 150000 | US | 50 | US | L |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 602 | 2022 | SE | FT | Data Engineer | 154000 | US | 100 | US | M |
| 603 | 2022 | SE | FT | Data Engineer | 126000 | US | 100 | US | M |
| 604 | 2022 | SE | FT | Data Analyst | 129000 | US | 0 | US | M |
| 605 | 2022 | SE | FT | Data Analyst | 150000 | US | 100 | US | M |
| 606 | 2022 | MI | FT | AI Scientist | 200000 | IN | 100 | US | L |
565 rows × 9 columns
#Visual 1: which roles have been most popular in the field?
df1=df['job_title'].value_counts().head(10)
df1
Data Scientist 143 Data Engineer 132 Data Analyst 97 Machine Learning Engineer 41 Research Scientist 16 Data Science Manager 12 Data Architect 11 Big Data Engineer 8 Machine Learning Scientist 8 Principal Data Scientist 7 Name: job_title, dtype: int64
fig=px.bar(df1,x=df1.index,y=df1.values, text=df1.values,labels={'index':'Job Title','y':'Count'},title='<b>Top 10 popular roles in Data Science')
fig.show()
#Insight: Data Scientist is the most popular role in last three years. If anyone targets this role, then the competition would be less since it is highly likely to have more job listings.
#Data Analyst, which is the role that I am targeting, is also in demand, which is a good sign.
#Visual 2: Which roles attract the highest compensation package?
df2=df.sort_values(by='salary_in_usd',ascending=False).head(10)
df2
| work_year | experience_level | employment_type | job_title | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
|---|---|---|---|---|---|---|---|---|---|
| 252 | 2021 | EX | FT | Principal Data Engineer | 600000 | US | 100 | US | L |
| 97 | 2021 | MI | FT | Financial Data Analyst | 450000 | US | 100 | US | L |
| 33 | 2020 | MI | FT | Research Scientist | 450000 | US | 0 | US | M |
| 157 | 2021 | MI | FT | Applied Machine Learning Scientist | 423000 | US | 50 | US | L |
| 225 | 2021 | EX | CT | Principal Data Scientist | 416000 | US | 100 | US | S |
| 63 | 2020 | SE | FT | Data Scientist | 412000 | US | 100 | US | L |
| 523 | 2022 | SE | FT | Data Analytics Lead | 405000 | US | 100 | US | L |
| 519 | 2022 | SE | FT | Applied Data Scientist | 380000 | US | 100 | US | L |
| 25 | 2020 | EX | FT | Director of Data Science | 325000 | US | 100 | US | L |
| 482 | 2022 | EX | FT | Data Engineer | 324000 | US | 100 | US | M |
fig=px.bar(df2,x='job_title',y='salary_in_usd',color='job_title',title='<b>Top 10 highest paid roles in Data Science')
fig.show()
#Visual 3: Where on earth has had the highest demand for Data Science people?
df3=df['company_location'].value_counts().head(10)
df3
US 355 GB 47 CA 30 DE 28 IN 24 FR 15 ES 14 GR 11 JP 6 NL 4 Name: company_location, dtype: int64
fig=px.bar(df3,x=df3.index,y=df3.values,text=df3.values, labels={'index':"Country",'y':'Count'},title='<b>Top 10 countries that have more jobs in Data Science')
fig.show()
#Insight: The US is certainly a tech hub where it created the far highest number of data science jobs, compared to others.
#Relocation and remote working for US companies would be ideal for those who are looking for jobs. It does not always have to base in one location.
#Visual 4: Is the trend of remote working in Data Science a hot topic?
df4=df['remote_ratio'].value_counts().head(10)
df4
100 381 0 127 50 99 Name: remote_ratio, dtype: int64
fig=px.bar(df4,x=df4.index,y=df4.values,labels={'index':'Remote Work Ratio','y':'Count'},title='<b>The trend of working from home in Data Science')
fig.show()
#Insights: So this bar chart confirms that remote working option has been the fashion to attract talents, given the number of jobs providing this flexibility
#Visual 5: Did we have lots of jobs for Entry Level people? New graduates badly need this information.
fig=px.pie(df.groupby('experience_level',as_index=False)['salary_in_usd'].count()
,names='experience_level',values='salary_in_usd',color='experience_level'
,color_discrete_sequence=px.colors.sequential.RdBu
,title='<b>Percentage of jobs based on experience level')
fig.show()
#Insights: We did not have a high percentage of jobs for Entry Level or Junior people, compared to Mid-level or Senior.
#This is a bad news, however, it is good to keep in mind.
#People who is new to data should always upgrade their kills to climb up, to reduce the competition.
#Visual 6: A closer look at median, mean, min, max salary for each level.
px.box(df,x='experience_level',y='salary_in_usd'
,color='experience_level'
,template='ggplot2'
,labels={'experience_level':'Experience Level','salary_in_usd':'salary in usd'}
,title='<b>Data Science salaries by experince level')
#Visual 7: So what are the employment types? Which one is more popular?
fig=px.pie(df.groupby('employment_type',as_index=False)['salary_in_usd'].count()
,names='employment_type',values='salary_in_usd',color='employment_type'
,hole=0.5
,title='<b>Percentage of jobs based on employment type')
fig.show()
#Insights: surprisingly, full-time based employees are seeked the most.It seems companies who employed would prefer to build a team that is working full-time hours rather than part-time hours or contract-based.
#Visual 8: How is the size of employing companies? How big are they to afford to have data-related employees on board?
fig=px.pie(df.groupby('company_size',as_index=False)['salary_in_usd'].count()
,names='company_size',values='salary_in_usd'
,hole=0.7
,title='<b> Percentage of companies by size')
fig.show()
#Insights: So, to create a team of data analytics, the company tend to be from medium to large rather than in small scale.
#Visual 9: more statistics about salary by the years
px.violin(df,x='work_year',y='salary_in_usd',color='work_year',labels={'work_year':'year','salary_in_usd':'salary in usd'},template='seaborn',title='<b>Data Science Salaries by year')
#Visual 10: Now, how has the average salary changed over years, expecially when pandemic hit?
x=['2020','2021','2022']
fig=px.line(df,x,y=df.groupby('work_year')['salary_in_usd'].mean()
,labels={'x':'Year','y':"Average Salary"}
,title="<b>Average salary of Data Science from 2020 to 2022")
fig.show()
#Insight: wow, there was a significant increment for average salary from 2021 to 2022, from around $100k to $122k per year.
#This is a huge and promising for data-nerds.
#Reference from https://www.coursera.org/articles/data-scientist-salary, salary is expected to be $125k in 2023 in the US
So to conclude this analysis, here are the takeaways or insights extracted from the dataset:
1.US is the country that has had the highest number of jobs.
2.Full-time employees could maximize their earnings more than Part-time or Contract-based.
3.Data Scientist is the most seeked role in Data Field in the past three years.
4.Remote working flexibility has continued to be the fashion.
5.There is a huge increment in average salary from 2020 to 2022, which is expected to rise up to $125k in 2023 according to Courera.
I hope these facts are helpful for anyone who is seeking for first job in 2023 or struggling to break in the field, or feeling demotivated by how much effort to put in to be employed.
Thank you!